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A METHOD FOR OBTAINING CONSENSUS CLASSIFICATIONS AND IDENTIFICATIONS 
BY COMBINING DATA FROM DIFFERENT EXPERIMENTS 

5 BACKGROUND TO THE INVENTION 

Classification techniques have been used in the field of biology to determine relationships and 
patterns in large sets of data. A means for placing an unidentified organism into a 
classification group which reflects its genotype is important to obtain an understanding of the 
organism and its features, particularly in the treatment and study of disease and in the 

10 assessment of biodiversity. Information which Identifies the closest known relation to an 
unknown disease-causing bacterial or viral strain would enable a physician to prescribe the 
most appropriate course of treatment based on such knowledge. Similarly, information which 
could, for example, enable a group of patients suffering from a disorder that has several 
genetic and phenotypical traits to be catagorlsed into nearest-relation groups would enable 

15 subpopulations of patients to receive the appropriate treatment and/or enable a research 
scientist to discern the most appropriate set of genes and associated biochemical pathways 
for further study. More particularly, the study of infections and epidemic diseases, for 
example, in hospitals or caused by distributed food or contaminated water, can benefit from 
adequate classification of the causative agents, in a way that the sources of an outbreak or 

20 infection can be traced and eradicated. The development of DNA chips and microarrays has 
increased the amount of data available for analysis, and increased the need for methods that 
can rapidly and reliably analyse raw experimental data. 

Classification techniques have been described in the art. PCT patent application numbers 
25 WO 01/20536 and WO 01/73602 disclose methods for producing hierarchical clusterings from 
large sets of biological data. PCT patent application WO 01/45026 discloses a method for 
displaying data resulting from a consensus classification analysis in a visually 
comprehensible format. 

30 While methods to determine classifications have been developed to deal with the processing 
of large amounts of experimental data, the problem remains with the accuracy of the 
classifications. Present methods do not take account of the quality of the experimental data 
from which classifications are made, and hence the classifications so-produced are distorted, 
leading to incorrect analyses which may seriously affect the course of a treatment regimen. A 
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further problem with conventional classification methods is that they are unable to provide a 
classification when data is missing, or they ignore the fact that an item of data is missing and 
distort the consensus classification so produced. Distorted and incorrect consensus 
classifications have serious implications for medical research and for the treatment of 
5 diseases. Thus there are time/cost and health benefits in finding a method which can produce 
accurate consensus classifications. 

AIMS OF THE INVENTION 

It is an aim of the present invention to provide a method suitable for producing accurate 
10 classifications and which overcomes the limitations of the prior art. It is further an aim of the 
present invention to provide an apparatus for producing accurate classifications and which 
overcomes the limitations of the prior art. 

DETAILED DESCRIPTION OF THE INVENTION 

15 As used herein the term, "classification" refers to the classification of organisms, or of the 
data from experiments performed on said organisms or samples thereof. 

The term "consensus classification" refers to a classification of organisms, or of the data from 
experiments performed on said organisms or samples thereof, wherein the data is derived 
20 from two or more techniques, and wherein said techniques produce one or more data types. 

A consensus classification may show the degree of relationship between organisms or 
samples thereof. Consensus classification may indicate the hierarchical ordering within a set 
of organisms or samples thereof. Consensus classifications may provide information 
25 necessary to construct a dendrogram or grouping by means other than cluster analysis. 

By "organism" herein is meant any animal or plant, any cell, bacterium, yeast, phage, virus or 
prion; it includes organisms of any genus, species, sub-type, biotype. phenotype or genotype. 



By ."sample"- herein is-meant a portion which Is intended to represent the whole. Non-limiting 
examples of samples are environmental samples such as soil, water, or clinical samples 
such as blood, sputum, stool, urine. Samples may contain many organisms as defined sbove. 
Samples may comprise extracts of organisms such as their DNA, proteins, glycoproteins, or 
other quantifiable substances. 
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In one embodiment of the present invention, data derived from experiments is classified 
according to the "data type" that is produced by the result of the experiment. 

5 In one embodiment of the present Invention, the data type is one or more "binary characters 1 '. 
By "binary characters" is meant the result(s) from an experiment is one of two outcomes, said 
result recordable as a binary character. For example, the outcome of the experiment might be 
either "present" or "absent", "positive" or "negative", "high" or "low", "dark" or "bright", 
recordable as a binary 1 or O, 

10 

In another embodiment of the present invention, the data type is one or more "continuous- 
scale characters". By "continuous-scale characters " is meant the result(s) from an 
experiment is a value which reflects a magnitude, said result recordable as a continuous- 
scale character such as, for example, a decimal number between -100 and +100. This 
15 category includes, but is not limited to, data from experiments to measure concentrations, 
kinetic properties, intensities, etc. 

In another embodiment of the present invention, the data type is one or more "multistate or 
categorical characters". By "multistate or categorical characters" is meant the result(s) from 

20 an experiment is a state or category, said result recordable as a state or category name-tag. 
This category includes, but is not limited to, for example data from experiments which result in 
a colour (e.g. category names red, blue, yellow, black, green), a shape (e.g. category names 
cube, sphere, cylinder), but also sequence-related genotypic properties such as genotypes 
resulting from Multilocus Sequence Typing (MLST) ,Variable Number of Tandem Repeats 

25 (VNTR) typing, Microsatelite analysis, DNA conformational structure analysis (Single Strand 
Conformational Polymorphism and heteroduplex electrophoresis), Single Nucleotide 
Polymorphism (SNP) analysis, etc. Binary characters may be interchangeable with multistate 
or categorical characters when the result of an experiment is one of two categories. 

30 In another embodiment of the present invention, the data type is one ore more "product size 
or retention time characters". By "product size or retention time characters" is meant the 
result(s) from an experiment is the magnitude of one property recorded as a function of the 
magnitude of a second property, recordable as a character. This category includes, but is not 
limited to, data from experiments which result in an electrophoresis gel, a high performance 
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liquid chromatography (HPLC) chromatogram, a gas chromatogram, a capillary 
electrophoresis chromatogram, a paper chromatogram, a thin-layer chromatogram, mass 
spectrometry techniques such as MALDI-TOF. 

5 In another embodiment of the present invention, the data type is one or more "sequence 
characters By "sequence characters" is meant the result(s) from an experiment is a 
sequence of characters corresponding to, for example, DNA, RNA and/or amino acid 
sequences, recordable as a sequence of characters. 

10 In one embodiment of the present invention, characters of data derived from experimental 
tests, observations or recordings taken from samples of organisms (e.g. intra and/or 
extracellular material from one of more organisms) are collected. 

In one embodiment of the present invention, said data Is recorded as an "experimental data 
15 matrix". 

As used herein an "experiment" refers to an experimental action or set of experimental 
actions which leads to an observation or a set of observations that can be recorded as a 
dimensional array of measurements for a single organism, sample or genotype. Non-limiting 
20 examples of experiments include measuring ceil length, determining an antibiotic resistance 
profile, incubating a microtiter plate to measure 96 enzymatic activities, running an RFLP 
electrophoresis pattern, a two-dimensional gel, obtaining a DNA sequence. 

As used herein an "experimental data matrix" means a set of results from one or more 
25 individual experiments of the same data type. A non-limiting example of an experimental data 
matrix is given in Figure 1-A, wherein four different experiments which produce results of the 
same data type (experiments 1 to 4) have been performed on samples from four different 
organisms (genotype 1 to 4). While the results are presented as a matrix in Figure 1-A, the 
term "matrix", as used herein, does not limit the data to the matrix format; data may be 
30 recorded or- presented in any format. Non-limiting examples of data formats include 
dimensional arrays, lists, coded data, data. 

As used herein, the term "similarity value" means a value which indicates the relative 
similarity or distance between two or more samples, organisms, and/or genotypes. The 
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similarity value can be a direct result from the experiment (for example, DNA-DNA 
hybridizations) or can be the result of the comparison of two experimental data sets by means 
of a similarity or distance coefficient, known in the art Since similarity values and distance 
values can be easily converted into one another, the distinction between both types is only 
5 morphological. Therefore the term "similarity value 11 is used throughout this document to 
include both similarity and distance values, except when a distance value is explicitly meant. 

As used herein, the term "similarity matrix" means a set of similarity values derived from two 
or more formats including one-dimensional, two-dimensional, and multi-dimensional arrays, 

10 individual experiments of the same data type. The term "matrix", as used herein, does not 
limit the data to the matrix format; data may be recorded or presented in any format. Since 
similarity matrices and distance matrices can be easily converted into one another, the 
distinction between both types is only morphological. Therefore the term "similarity matrix" is 
used throughout this document to include both similarity and distance matrices, except when 

15 a distance matrix are explicitly meant. 

In one embodiment of the present invention, the experimental data matrix is used to calculate 
a similarity or distance between organisms, resulting in a similarity matrix as defined above. A 
non-limiting example of a similarity matrix is provided in Figure 1-B, wherein the similarity 
20 matrix has been calculated from the experimental data matrix shown in Figure 1-A. Similarity 
matrices may be calculated from experimental data matrices using techniques known In the 
art. 

It is within the scope of the present invention to use a similarity matrix to present 
25 classifications between, for example, samples, organisms and/or genotypes. In a non-limiting 
example, a clustering algorithm might be applied to a similarity matrix to produce a 
dendrogram or tree. Figure 1-C provides an example of a dendrogram produced from the 
similarity matrix shown in Figure 1-B. 

30 As used herein, a "composite similarity matrix w refers to a new data matrix resulting from the 
combination of two or more similarity matrices. A composite similarity matrix may or may not 
be derived from a single data type. In one non-limiting example, a composite similarity matrix 
is formed from the combination of two or more similarity matrices derived from the same data 
type. In another non-limiting example, a composite similarity matrix is formed by combining 
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similarity matrices derived from different experimental data types e.g. a similarity matrix 
derived from results of a DNA electrophoresis gel of a restriction digest with a similarity matrix 
derived from quantification of mRNA expression levels. 

One aspect of the present invention is a method that can generate a consensus classification 
from the combination of two or more similarity matrices by constructing a composite similarity 
matrix. 

The invention encompasses methods which may result in a means to view said classifications 
such as, for example, a dendrogram, a Principal Components Analysis (PCA), a Self- 
Organizing Map (SOM), a Discriminant Analysis (DA), and other known grouping techniques. 

The invention encompasses methods which may result in a means to view phylogenetic 
evolution, mutational evolution or clonal relationships such as, for example, a phylogenetic 
tree, an average distance tree, a minimum spanning tree, or any other graphical 
representation that visualises clonal or evolutionary relationship. 

The invention encompasses methods which may evaluate the quality of the resulting 
dendrogram, grouping or display of evolution by means of a general quality score or quality 
score at each branching point (co-phenetic correlation, standard deviations, Jackknife and 
Bootstrap tests) and the invention can use these values to optimise and steer an analysis 
procedure. 

The inventors have found that the use of composite similarity matrices to produce a 
consensus classification, leads to a surprising increase in the accuracy of the classification so 
produced. The inventors have further found that the combining of similarity matrices 
according to a method of the invention, including when diverse experimental data-types are 
available, such as, for example similarity matrices derived from DNA electrophoresis 
experiments and similarity matrices derived from enzymatic or metabolic activity assay 
studies, the classification so produced is surprisingly more accurate compared with 
conventional methods of the art. 
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One embodiment of the present invention is a method suitable for producing a consensus 
classification of organisms using the data derived from two or more experiments performed 
on said organisms or samples thereof comprising the steps of: _ . 

i) obtaining similarity matrices from the said data, 
5 ii) producing a composite similarity matrix that is a function of said similarity matrices, and 
iii) producing a consensus classification from said composite similarity matrix. 

Another embodiment of the present invention is a method as defined above wherein the 
function of step ii) comprises averaging the corresponding elements of said similarity 
10 matrices. 

Another embodiment of the present invention is a method as defined above wherein each 
similarity matrix is weighted according the number of experimental characters used to 
calculate said matrix, to arrive at the average. 

15 

Another embodiment of the present invention is a method as defined above wherein each 
similarity matrix is weighted by a user defined value to arrive at the average. 

Another embodiment of the present invention is a method as defined above, wherein said 
20 experiments produce product size or retention time results, and wherein the each element of 
each similarity matrix is weighted according to the number of bands or features associated 
with that element, to arrive at the average. 

Another embodiment of the present invention is a method as defined above wherein said 
25 experiments are any of electrophoresis, high performance liquid chromatography, gas 
chromatography, capillary electrophoresis, chromatography, thin-layer chromatography, 
and/or mass spectrometry. 

Another embodiment of the present invention is a method as defined above wherein the 
30 function of step ii) comprises the steps of; 

a) linearising said similarity data matrices, 

b) averaging the corresponding elements of said linearised similarity matrices of step a) 
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Another embodiment of the present invention is a method as defined above wherein step a) 
comprises the minimisation of equations: 

/=! J=l 



fel 7=1 



5 wherein p is the number or organisms, samples or genotypes, wherein each technique k 
results in a matrix of pair-wise distance values, so that the distance value obtained between 

organism i and j from technique k is given by d kJJ , wherein ^=£m. wjth 

2 _p f-i 

S * = ( jP -lXp-2)§5 rf * !/ ' wnereln the cons ensus distance matrix D l} is considered as the 
unknown true universal distance scale and wherein the goal is to search the consensus 
10 distances D iJt g k and f k so that d kJj and D ijSgk {d kJj ) hold as true as possible. 

Another embodiment of the present invention is an apparatus suitable for performing the 
methods as defined above. 

1 5 Another embodiment of the present invention is a computer program comprising a computing 
routine, stored on a computer readable medium suitable for producing a consensus 
classification of organisms using the data derived from two or more experiments performed 
on said organisms or samples thereof comprising according to the methods as defined above. 



Another embodiment of the present invention is a device suitable for producing a consensus 
classification of organisms using the data derived from two or more experiments performed 
on said organisms or samples thereof comprising according to the methods as defined above. 



25 Averaging of similarity matrices 

In one embodiment of the present invention, a method of obtaining a consensus classification 
comprises the steps of: 

i) calculating two or more similarity matrices from two or more experimental data matrices, 
using methods known in the art, wherein the data type of one experimental data matrix may 
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be the same as that of other experimental data matrices in the calculation, or the data type of 
one experimental data matrix may be different from that of other experimental data matrices 
in the calculation, . . 

ii) calculating a composite similarity matrix by averaging the corresponding elements of the 
respective matrices, i.e. between the same pairs of organisms or samples, and 

iii) calculating a consensus classification from the composite similarity matrix of step ii). 

According to one embodiment of the invention, a composite similarity matrix is calculated by 
averaging the corresponding elements of the respective similarity matrices. This means, for 
example, in the case of element 1, that sum of element 1 from similarity matrix 1, element 1 
from similarity matrix 2, element 1 from similarity matrix 3 etc. is divided by the number of 
matrices, so producing an average of element 1 for the similarity matrices used in the 
calculation. 

In one embodiment of the present invention, a method of obtaining a consensus classification 
comprises the steps of: 

i) calculating two or more similarity matrices from two or more experimental data matrices, 
using methods known in the art, wherein the data type of one experimental data matrix may 
be the same as that of other experimental data matrices in the calculation, or the data type of 
one experimental data matrix may be different from that of other experimental data matrices 
in the calculation, 

ii) calculating a composite similarity matrix by averaging the corresponding elements of the 
respective matrices, i.e. between the same pairs of organisms or samples, wherein each 
element is weighted by user-defined parameters, and 

iii) calculating a consensus classification from the composite similarity matrix of step ii). 

In one embodiment of the present invention, a method of obtaining a consensus classification 
comprises the steps of: 

i) calculating two or more similarity matrices from two or more experimental data matrices, 
using methods known in the art, wherein the data type of one experimentai data matrix may 
be the same as that of other experimental data matrices in the calculation, or the data type of 
one experimental data matrix may be different from that of other experimental data matrices 
in the calculation, 
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ii) calculating a composite similarity matrix by averaging the corresponding elements of the 
respective matrices, wherein each similarity matrix is weighted according the number of 
experimental characters used to calculate said matrix, and 

iii) calculating a consensus classification from the composite similarity matrix of step ii). 

The inventors found that the averaging and weighting method of step ii) in the above 
embodiments provides a surprising increase in the accuracy of the resulting consensus 
classification. 

According to one embodiment of the present invention, two or more similarity matrices are 
averaged according to equation [1], wherein Sc is the resulting composite similarity matrix, S, 
is similarity matrix 1, ^ and similarity matrix 2, and S„ is similarity matrix n. Equation [1] is an 
example of combining two or more similarity matrices using unweighted averages. 

* r -Si+S 2 + + S„ 

n [13 

A non-limiting example of the embodiment is provided in Figure 2-A wherein similarity matrix 
S, is derived from 15 experimental character types and similarity matrix S 2 is derived from 6 
experimental character types. The composite similarity matrix so formed is calculated 
according to equation [1]. 

According to another embodiment of the present invention, two or more similarity matrices are 
averaged according to equation [2], wherein Sc is the resulting composite similarity matrix, S, 
is similarity matrix 1, a is the number of elements used in the experimental data matrix to 
calculate similarity matrix 1; S 2 is similarity matrix 2, b is the number of elements used in the 

experimental data matrix to calculate similarity matrix 2; S n is similarity matrix n, p is the 
number of elements used in the experimental data matrix to calculate similarity matrix S„ 
Equation [2] is an example of an equation for combining two or more similarity matrices using 
weighted averages. 

Sr _ aS 1 +bS 0 _+ + pS_ 

(a + b + + p) [2] 
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An example of combining similarity matrices using weighted averages based on equation f2] t 
is.provided in Figure 2-B, wherein similarity matrix.Sv was calculated from an experimental 
data matrix containing 15 elements and similarity matrix S 2 was calculated from an 
5 experimental data matrix containing 6 elements. Using weighted averaging, the composite 
similarity value Sc is calculated according to equation [2]. 

The averaging method according to equation [2] provides a surprising improvement in the 
accuracy of the consensus classification so-produced. 

In another embodiment of the present invention, the values of a, b..*p in Equation [2] are 
determined by the user. The averaging method wherein the user determines the values of a, 
b, ...p may also provide an improvement in the accuracy of the so produced consensus 
classification, when conducted by an experienced user who is familiar with the data types 
being analysed. 

In one embodiment of the present invention, the similarity matrix calculated from experimental 
data comprising characters belonging to the "product size or retention time" data type, for 
example DNA electrophoresis gels, is calculated by comparing data sets two by two. Figure 
20 3-A illustrates a non-limiting example wherein the similarity values are calculated from DNA 
electrophoresis fragment patterns resulting from a restriction digest using a restriction 
enzyme (experiment 1) and using DNA from three different organisms (genotype 1, 2 and 3). 
In Figure 3-A, every matching band is considered as one band occurrence, and every 
unmatched band on either pattern is aiso a band occurrence. For example in comparing 
25 genotype 1 and genotype 2, genotype 1 contains 3 bands , genotype 2 contains 2 bands, 2 
bands match, the similarity coefficient would be calculated at 2/3 or 66.667%. Thus the 
similarity matrix derived from experimental data of the "product size or retention time" data 
type, is built-up, according to a method of the present invention, by comparing data sets two 
by two. 

30 

In another embodiment of the present invention, similarity matrices derived from experimental 
data matrices containing characters of the "product size or retention time 1 ' data-type may be 
combined to form a composite similarity matrix by taking the number of bands or features 
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associated with each element of each similarity matrix for use as a weighting constant of that 
element. 

A method of the present invention for combining similarity matrices derived from experimental 
data matrices containing characters of the "product size or retention time" data type 
comprises the use of equation [3], Equation [3] assumes that N product size or retention time 
data types are combined into a composite similarity matrix, wherein wherein Sc< is the value 
of element e of the composite similarity matrix (So), S^ is the value of element e of the 
similarity matrix from product size or retention time data type i tP ^ s the number of 
experimental features associated with that element, and S' Di is the value of element e of the 
similarity matrix from product size or retention time data type /. 

Sc<=J^ 

Figure 3-B shows a non-limiting example whereby a composite matrix is calculated from 
similarity matrices derived from different DNA restriction digest results. Box [1] shows the 
results of a DNA restriction digest upon the DNA of the organisms of Figure 3-A (Gen 1 2 
and 3), using a different restriction enzyme (exp 2). Box [2] shows the similarity matrix 
produced therefrom, using the method described for Figure 3-A, and Box [2] also shows the 
similarity matrix resulting from Exp 1 . The elements or similarity coefficients of the composite 
s.m.lanty matrix shown in Box [3] are calculated using equation [3]. For sample the 
composite similarity value between Gen 1 and Gen 2 (83%) is calculated by weighting the 
S .m,larity value between Gen 1 and Gen 2 measured in exp 1 (66%) by the number of bands 
assoc.ated with that measurement (3 bands); and weighting the similarity value between Gen 
1 and Gen 2 measured in exp 2 (88%) by the number of bands associated with that" 
measurement (9 bands); summing the weighted values (3x66 + 9x88) and dividing by the 
total number of bands (12 bands) to arrive at the composite similarity value (83%) 
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The inventors have found that the combining of similarity matrices according to the methods 
of the invention described above, including when data-types such as those from DNA 
electrophoresis experiments are available, the consensus classification so produced is 
surprisingly more accurate compared with conventional methods of the art. 

5 

According to another embodiment of the present invention, the averaging of similarity 
matrices to produce a composite similarity matrix as described above may be generally 
applied to similarity matrices derived from any combination of experimental data sets or data 
matrices. It may include the combination of a similarity matrix derived from one data type with 
10 a similarity matrix derived from another data type. It may also include the combination of a 
similarity matrix derived from one data type with a similarity matrix derived from the same 
data type. It may also include the combination of more than two similarity matrices, each 
derived from a different data type. 

15 Linearisation of similarity matrices. 

The method described above to create a composite similarity matrix based on the weighted or 
unweighted average of individual similarity or distance matrices works well when the 
expected discriminatory range for both techniques is comparable, and when the matrices are 
complete, he. for each experiment there is a similarity value present for each pair of entries. 

20 However, when two experimental techniques are performed on the same set of organisms, 
and they generate strongly different similarity or distance levels, the composite similarity 
matrix formed using the method disclosed above can be distorted. For example, DNA 
hybridisation and16S rDNA sequencing are techniques whose discriminatory ranges are 
different. Figure 5A compares the distance matrices from DNA hybridisation values and16S 

25 rDNA gene sequences. On the scale of 16S rDNA sequence distance (the X-axis) T the DNA 
hybridisation-based distances occupy a narrow range close to zero distance, whereas on the 
scale of DNA hybridisation (the Y-axis), 16S rDNA sequence distances occupy a narrow 
range on the most distant side of the scale. This effect is due to the nonlinear relation 
between both matrices. Both matrices can be considered as zoomed windows on different 

30 ranges of a hypothetical linear distance scale. Thus the averaging of similarity matrices 
derived from experiments with different discrimination ranges can lead to a distorted 
composite similarity matrix. 
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The averaging of two similarity or distance matrices with different ranges wherein one or both 
matrices contain missing values can even lead to stronger distortion. For example, Figure 5B 
shows two similarity matrices of values from DNA homology and 16S rDNA sequence 
experiments on the same organism (A, B, C and D). As shown, DNA homology values range 
from 88% to 40%, whereas 16S rDNA gene identity ranges between 98% and 88%. In 
addition, the DNA hybridization homology matrix has some missing elements. In spite of 
these missing elements, it is clear from the DNA hybridization homology matrix alone that the 
four genotypes analysed consist of two groups: [A,B] and [C,D]. The 16S rDNA sequence 
identity matrix also suggests the same groupings, although at a different scale. 

The composite similarity matrix created from these two techniques shows averaged values for 
[AB]. [BC], [BD], and [CD] but for [AC] and [AD] it has taken the only available values, 90% 
and 88%, respectively The resulting matrix provides a distorted view of the relationships 
between these three organisms, as it suggests [AC] and [AD] to be at least as closely related 
as [AB]. The resulting UPGMA dendrogram also depicts a different classification as compared 
to the two dendrograms derived from DNA hybridisation and 16S rDNA sequence identity 
individually. 

One aspect of the invention is a method for constructing a composite similarity matrix that 
combines the Information present in the individual matrices in a way that the useful 
information from each of the constituent matrices is optimally preserved, i.e. by respecting the 
particular discriminatory ranges of the techniques applied. 

In one embodiment of the present invention, similarity matrices are combined by averaging to 
form a composite similarity matrix wherein the constituent similarity matrices are linearised 
prior to averaging. 

By "linearise" in reference to a similarity matrix herein is meant adjusting the value of each 
element of said matrix such that those values that fall in the window considered useful to 
classification (the range of the technique) are placed on the same linear scale as those useful 
values of one or more other similarity matrices used in the calculation. 
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One advantage of using an averaging method incorporating a linearisation step for combining 
similarity matrices is that the similarity range for both experiments does not need to be 
comparable. 

5 Another advantage of an averaging method incorporating a linearisation step for combining 
similarity matrices is that the discriminatory depth (taxonomic depth or phylogenetio depth) of 
the methods used to construct the constituent similarity matrices do not need to be the same. 

Another advantage of using an averaging method incorporating a linearisation step for 
10 combining similarity matrices is that the individual similarity matrices do not need to be 
complete. The method allows incomplete similarity matrices, resulting from experimental data 
matrices with missing experiments, to be combined successfully without distortion of the 
composite similarity values towards one of the constituent values. 

15 In one embodiment of the invention, a method for linearising similarity matrices according to 
the invention comprises a mathematical description to the problem of linearising matrices and 
a solution therefor. One possible mathematical description and solution therefor for 
linearising similarity matrices according to the invention is disclosed below. 

20 Attributes of the problem 

In one mathematical description of the problem according to invention, p organisms, samples 
or genotypes, and n experimental techniques applied on these organisms, samples or 
genotypes are considered. Each technique k results in a matrix of pair-wise distance values, 
so that the distance value obtained between organism / and j from technique k is given 

25 byd kiJ . In most cases, these distance values are the result of calculating a mathematical 

distance coefficient on the experimental data sets. 
Non-limiting examples are: 

• Nucleic acid sequences: number of mutations in aligned sequences 

• Banding patterns resulting from electrophoresis of DNA restriction fragments: 
30 number of different bands, distance-converted Jaccard or Dice coefficients 

• Enzymatic activity profiles: distance-converted product-moment correlation 

• Microarrays: Euclidean distance. 
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According to the invention, experimental data matrices obtained from the techniques are not 
necessarily complete; some experiments may not have been performed for some organisms 
or samples, which results in incomplete distance matrices, i.e. distance matrices with missing 
values. Alternatively, experimental data matrices obtained from the techniques are complete. 

Mathematical description of the problem 

The distance scales from the different experiments are normalized by dividing them by their 
corresponding Root-Mean-Square (RMS) values according to Equations [4] and [5] below. 

1 0 where 

s * = (p-ixp-2)SS^ [53 

The consensus distance matrix D is considered as an unknown "true" universal distance 
scale, with all the individual distance scales for each experiment being mapped on that 
15 universal distance scale by some nonlinear function according to equation [6], wherein each 
individual experiment k has its own functional dependence (Figure 4). 

d kJJ = , V*,i,y [6] 

In practical cases, this relationship will not be 100% exact (e.g. due to scatter on the 
20 measurements and because of practical limitations of the experiments). 

Equivalents, the consensus distances are connected to the individual distances by nonlinear 
function [7], wherein the ideal case g k =f*\ 

D tJ z*g k (d k ^, VkJ.j [7] 



25 



The goal is to search the functions and g k so that these relations hold as close as 
possible. Each function f k holds information about the range of the experiment k. 

There are some considerations that put constraints on the functions f k as follows. 
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1 . Since identical organisms should have zero distance for every possible technique, 
f k (0) = 0 , and consequently g k (0) = 0 . 

2. On average, the distance d kJ9 between a pair of organisms should increase if the 
"true" biological distance increases. However, due to experimental errors, statistical 

5 fluctuation, or imperfections of the technique, this may not be true for particular cases. But, as 
an overall trend, the inventors have found that it should hold for every well-designed 
comparison experiment. Hence, all functions f k should increase monotonically. As a direct 
consequence, g k should increase monotonically as well. 

10 Summarising the mathematical description of the problem, the goal is to search the 
consensus distances D^and the functions g k and / A so that d kiJ = and 
D g s g k (d k ^ hold as true as possible. This can be put in a more exact, least square sense, 
by minimizing equations [8] and [9]. 



25 



1 5 izfo-s*^.*)) 2 - V * [91 

1=1 J=l 

There are many equivalent solutions to this problem. This reflects the fact that only relative 
differences in distances have a physical interpretation. There is no interpretation of the 
consensus distance values themselves. For example, applying any monotonically increasing 
20 function on the consensus distances will hold a new, equivalent solution. 

In the equations, we ignored the fact that some distances d k0 may not be known. However, 

the extension is everywhere straightforward: the summations should be modified to exclude 
the absent values. We kept the simple forms in order not to overload the notations and 
mathematical formulas. 



Mathematical solution to the problem 

In order to further parameterize the problem, one can write the functions f k and g k as a 
linear combination of a number of basis functions according to equations [10] and [11], 
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wherein m is the number of basis functions used, B t is the basis functions, and f kJ and g kJ 
are scalar values that are the unknowns of the problem. 



/*<<o«£/«*iW) no] 

= [11] 



1=1 



One choice for the basis functions (although not the only possibility) are power functions 
according to equation [12]: 

B t W = d l . [12] 

1 0 These expansions are subject to four constraints: 

1. /*(0)=0, 

2. g*(0)=0, 

3. f k increases monotonically, and 

4. g k increases monotonically. 

15 

One can account for 1. and 2. by only using basis functions that fulfill the same criterion: 
B t (0) = 0 . Points 3. and 4. are discussed later below. 

The problem is now translated to the minimization of equations [13] and [14], with unknown 
20 values f k>l , s^and D {J . 

ZZ d *x -ZAAOV . V* [13] 

i»l z=i J 

1L?i D «-1L8*jB k (j kii ,)\ ,\fk [14] 

i=l j=l v '=1 J 

Unfortunately, these equations are not linear in the unknowns: equation [13] contains terms 
25 that are mixed In f kJ and D 0 . Moreover, B k (D 0 ) may be a nonlinear function. As a 
consequence, one cannot apply the theory of linear least squares optimization to this 
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problem. One possible solution would be to apply a general nonlinear iterative minimization 
algorithm (e.g. Levenberg-Marquard). 

However, in practical cases, it is sufficient to follow a simpler approach, by first minimizing 
5 equation [14] for an optimal solution for g kt and D i} , and then minimizing equation [13] for a 

solution for f kl . The inventors have found that this approach yields good results - equation 
[13] and [14] are to a equivalent to a high degree: they are in fact each other's inverse. A 
solution that is optimal for one equation will be almost optimal for the other one. 

10 Although equation [14] is linear in the unknown variables, one has to take special precautions 
in order not to arrive at the trivial, perfect but meaningless solution D {J = 0 and g kJ =0. One 
solution is to follow an iterative approach as follows: 



1 . Set the consensus distances equal to the averaged individual distances according 
to equation [15]. 

15 D„=±-±d kM [15] 

2. Minimize equation [14] for the unknown values g kJf keeping D u fixed. This is a 
standard linear least square problem. 

3. Minimize equation [14] for the unknown values D u , keeping g kl fixed. 

4. Standardize D i} in some way, e.g. by dividing by the total RMS value 
20 5. Return to 2. until convergence is achieved. 



Step 2. needs to take into account the fact that g k must increase monotonically. This can be 
achieved by using a Quadratic Programming technique rather than a simple least square fit. 
Basically, one evaluates the first derivative of g k in a number of fixed distance values d l , 
25 i = requiring that the first derivative should be non-negative everywhere: £*(4)>0, 
Vifc, i . For example, the points d. could be chosen is such a way that they are spread in an 
equidistant way over the whole distance range. A sufficient number of points needs to be 
used to ensure a consistent non-negative derivative. In each point d { , this condition 
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translates into a linear Inequality in the unknowns g kJ using equation [16], wherein B\ the 
first derivative of the basis function B t . 

^SkjB'i&t)^* V* t i [16] 

5 The minimization of the quadratic form given by equation [14], together with these linear 
inequalities are a standard problem that can be solved by methods known in the art, such as 
Quadratic Programming. 

When a solution is obtained for the consensus distances D u , it is still possible to apply any 
10 monotonically increasing function c(d) that has <r(0)=0. This does not change anything 
fundamental to the solution and has only a cosmetic interpretation. 
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FIGURES 

Figure 1-A, 1-B, 1-C. Example of a conventional cluster analysis based upon 4 organisms 
each characterized by four characters of the same type. 
20 Figure 2-A, 2-B, Combining similarity matrices using non-weighted and weighted averaging. 
Figure 3-A ( 3-B, 3-C. Weighted and unweighted averaging of similarity matrices obtained from 
DNA electrophoresis banding patterns. 

Figure 4. Figure illustrating the differences in active ranges of results of three experiments, 
d1, d2 and d3. 

25 Fig. 5A. Comparison between distance matrices from two techniques with different 
discriminatory ranges. 

Figure 5B: Example of averaging two similarity matrices with different ranges, and containing 
missing elements. 

Figure 6. Xbal and AvrH patterns for six E. coli strains belonging to different serotypes, 
30 according to Example 1 . 

Figure 7. UPGMA dendrogram obtained from Xbal patterns from six E. coli strains, according 
to Example 1 . 
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Figure 9. UPGMA dendrogram obtained from unweighted average similarities of Xbal and 
Avrll patterns from six E. coli strains. 

Figure 10. UPGMA dendrogram obtained from weighted average similarities of Xbal and Avrll 
patterns from six E. coli strains. 
5 Figure 1 1 . AFLP clustering of Xanthomonas. 

Figure 12. DNA hybridization clustering of Xanthomonas. 
Figure 13. 16S sequence homology clustering of Xanthomonas. 

Figure 14. Consensus clustering of AFLP, DNA hybridization and 16S rRNA sequences of 
Xanthomonas based upon unweighted average similarity matrix. 
10 Figure 15. Consensus clustering of AFLP, DNA hybridization and 16S rRNA sequences of 
Xanthomonas based upon linearized average similarity matrix. 

Figure 1 6. UPGMA clustering of two known species and one unknown species of genus A. 
Figure 17. Histon H3 sequence clustering of different members of a eukaryote genus. 
Figure 18. Consensus clustering of fatty acid composition and Histon H3 sequences. 

15 

EXAMPLES 

Example 1: Pulsed Field Gel Electrophoresis (PFGE) using different restriction enzymes to 
20 determine pathotype of E. coli 0157:H7 strains. 

Six pathogenic strains of E. co//0157:H7 are analyzed by means of PFGE using two different 
restriction enzymes: Xbal and AvriL By means of serological tests, the strains have been 
assigned to specific serotypes. The results of the experiments are shown in Figure 6 

25 

The similarities between the strains are calculated using the Dice coefficient: 

30 where N A is the total number of bands in pattern A, N B the total number of bands in pattern B, 
and NfAB] the number of common bands between patterns A and B. 
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A. PFGE-Xbal 

Using the Dfce coefficient on the experimental data for PFGE-Xba] depicted in Figure 1 
produces the similarity matrix in Tab/e 1. 



STR1 


100.00 












STR2 


66.67 


100.00 










STR3 


72.73 


60.01 


100.00 








STR4 


72.73 


60.01 


100.00 


100.00 






STR5 


57.15 


33.33 


50.00 


50.00 


100.00 




STR6 


0.00 


33.33 


25.00 


25.00 


0.00 


100.00 



Table 1. Similarity matrix of *bal patterns obtained from six E. coli strains. 

Cluster analysis using the unweighted pair group method with arithmetic averages (UPGMA) 
resuits in the dendrogram shown in Figure 7. PFGE-Xbal patterns are able to distinguish A1 
serotype strains from the others, as they cluster together. A2 serotype strains, however, couid 
no be Custered together as shown by the dendrogram in Figure 7. Furthermore, no 
dist.nction is possible between strains 3 and 4 from serotype A1. 



B. PFGE-Avri\ 

Using the Dice coefficient on the experimental data for PFGE-AvrU depicted in Figure 6 
produces the similarity matrix in Table 2. 



STR1 
STR2 


100.00 
75.01 


100.00 










STR3 


85.72 


57.15 


100.00 








STR4 


50.00 


75.01 


28.57 


100.00 






STR5 


66.67 


66.67 


50.00 


66.67 


100.00 




STR6 


66.67 


66.67 


50.00 


66.67 


100.00 


100.00 



Table 2. Similarity matrix of Avrtl patterns obtained from six E. coli strains. 
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Cluster analysis using the unweighted pair group method with arithmetic averages (UPGMA) 
results in the dendrogram shown in Figure 8, PFGE-Avril patterns are able to distinguish A2 
serotype strains from the others, as they cluster together according to the dendrogram in 
Figure 8. A1 serotype strains, however, could not be clustered together. Furthermore, no 
5 distinction is possible between the two strains from serotype A2. 

C. Combined clustering of Xba\ - Avi\\ bv unweighted averaging 

In a third analysis, the two similarity matrices (Xba\ and AvM) are combined to create a new 
10 consensus matrix by averaging the corresponding values of the two matrices using 
unweighted arithmetic averages. This results in the following composite matrix (Table 3): 



STR1 


100.00 












STR2 


70.84 


100.00 










STR3 


79.23 


58.58 


100.00 








STR4 


61.37 


67.51 


64.29 


100.00 






STR5 


61.91 


50.00 


50.00 


58.33 


100.00 




STR6 


33.33 


50.00 


37.50 


45.83 


50.00 


100.00 



Table 3. Composite similarity matrix obtained by unweighted averaging of Xba\ and Avi\\ 
15 similarities from six E. coli strains. 

When a UPGMA dendrogram is calculated from this composite matrix, the consensus 
clustering obtained is shown in Figure 9. The linear combination of both techniques is able to 
distinguish every strain from every other. However, serotype A1 strains, and in particular 

20 serotype A2 strains are still not grouped together. When looking more closely at the 
similarities between the A2 group strains STR5 and STR6, we find the two bands to be 
different in Xba\ analysis, whereas all 5 bands are the same in Avrtl analysis. This results In 
0% and 100% similarity in the respective matrices (see above). The unweighted average 
matrix consequently shows 50% similarity between the two strains, and the UPGMA algorithm 

25 does not cluster them together. 
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D. Combined ctusterinQ of Xbal - AvA\ by weighted averaging 

In a fourth analysis, the two similarity matrices (Xba\ and Avril) are combined to create a new 
compopsite matrix by averaging the corresponding values of the two matrices using weighted 
arithmetic averages. The weights are determined thus that they compensate for the amount 
5 of information produced by each pair of experiments. The formula used for the consensus 
similarity S c is given the formula: 
N A S A +N B S B 



N A , N b being the number of characters in experiments A and B, respectively, and S A and S B 
10 being the similarity in experiments A and B, respectively. Taking weighted averages results in 
the composite matrix shown in Table 4: 



STR1 


100.00 












STR2 


70.59 


100.00 










STR3 


77.79 


58.83 


100.00 








STR4 


63.17 


66.67 


73.69 


100.00 






STR5 


62.51 


53.33 


50.00 


58.83 


100.00 




STR6 


37.50 


53.33 


37.50 


47.06 


71.43 


100.00 



Table 4. Composite similarity matrix obtained by weighted averaging of Xbal and Avi\\ 
similarities from six E. coli strains. 



When a UPGMA dendrogram is calculated from this composite matrix, the consensus 
clustering obtained is that shown in Figure 10. Again, every strain is separated from every 
other. In addition, the strains are grouped according to serotype. When looking at the 
serotype A2 strains STR5 and STR6, they share a similarity of 71 .43%, which is the weighted 
average based upon 2 bands in Xbal and 5 bands in Avt\\\ 
(2x0%) + (5x100%) = ?1 43% 
2 + 5 



PFGE with Xbal only produces two bands per strain, which is far below the minimum to obtain 
a reasonably significant measure of similarity. Five bands as produced by Avril is still a low 
number, although more reliable than two. Provided that the experiments are merged with 
equal weight per observation (as obtained in formula [4]), a new composite experiment is 
created containing 7 observations, which results in the more reliable clustering as obtained. 
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Example 2: Classification of the genus Xanthomonas based upon 3 techniques: AFLP, DNA 
hybridization, and 16S rRNA gene sequencing. 

A total number of 29 strains belonging to the genus Xanthomonas together with the type 
5 strain of Its closest neighbor genus, Stenotrophomonas, have been analysed using three 
genomic techniques: (i) AFLP (Amplified Fragment Length Polymorphism), an electrophoresis 
technique in which some 30-80 bands are selectively amplified and electrophorized after total 
genomic restriction analysis; (ii) DNA hybridization, a technique in which the renaturation rate 
of a mixture of equimolar amounts of total genomic DNA from two organism is measured, to 
10 determine the overall homology between the genomes; (iii) 16S rRNA sequence analysis, in 
which a complete or a partial sequence of the 16S ribosomal RNA gene is determined, and 
sequences from different organisms compared by homology. 

A. AFLP 

15 In Xanthomonas, as in many other bacterial genera, AFLP is able to distinguish between 
individual strains, and is suitable for strain typing such as done in epidemiological and strain 
authentication studies. Thanks to the large number of bands revealed by the technique, the 
range of discrimination extends to the variety, subspecies or even species level. In the 
present example of Xanthomonas, strains from the same species are usually grouped 

20 together (except the more heterogeneous species X. axonopodis), whereas within the 
species, the technique is able to distinguish between pathovars (see Figure 11). This is 
reflected in the cases where strains that belong to the same species but constitute different 
pathovars (X. oryzae % X. axonopodis, X. hortorum, X, translucens): these strains have 
consequently lower similarities with each other than strains that are not classified in different 

25 pathovars (X fragariae, X. saccharn\ X. vesicatoria, X melonis, X. cassavae, X. codiaei, X. 
vasicola pv. holcicola, X. populi, X cucurbitae, and X. hyacinthi). 

However, deeper phylogenies such as the relationships between the species, and in 
particular, the relationship between its adjacent genus Stenotrophomonas, are not reflected, 
30 and the technique would falsely suggest that Stenotrophomonas maltophitia is a member of 
the genus Xanthomonas. 

The obtained similarity range in the AFLP study is between 10.3% and 96.3%. 

35 
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B. DNA hybridization 

DNA hybridization is a technique which typically differentiates at the species level. Different 
species can be distinguished from each other by low renaturation rates whereas within the 
same species, renaturation rates are usually high. Figure 12 shows a clustering of DNA 
renaturation rates between the same strains as used in the AFLP comparison. Unlike AFLP, 
the technique is not well suited to distinguish pathovars from one another: most within- 
species linkage levels are below the experimental error of the technique, which is known to 
be about 6%. Deeper phytogenies, however, are better revealed than by AFLP. This is 
reflected by the fact that all species, including X. axonopocfis, can be separated from each 
other. Furthermore, a deeper group is formed by the species X. translucens, X. hyacinth/, and 
X. sacchari, whereas Stenotrophomonas maltophilia is the most distantly clustered strain. 

The similarity range of the present DNA hybridization study is between 1 1 .3 and 100%. 
C. 16S rRNA seque nce comnarisnn 

As opposed to most other classification techniques. 16S rRNA sequencing is known to reveal 

deep phylogenetic relationships (genus and below). Figure 13 displays a cluster analysis of 

aligned 16S rRNA sequences from the same strains as described before.The dendrogram 

reveals a strongly different taxonomic structure as compared to AFLP and DNA hybridization. 

Within Xanthomonas, three phylogenetic groups can be found, represented by a main core 

consisting of most species (A), a separate group formed by the species X. hyacinthf and X. 

translucens, affecting monocotyls (B). and a third group formed by X. sacchari (from 

sugarcane) (C). The tatter two groups were also suggested by DNA hybridization (Figure 12) 

but are much more prevalent in the 16S rRNA sequence clustering. Distantly separated from 

Xanthomonas is the Stenotrophomonas maltophilia type strain. On the other hand, within-and 

between-species relationships are not dissolved by 16S rRNA sequencing. 

The similarity range of the present 16S rRNA sequence analysis study is between 94.4% and 
99.9%. 

It should be stressed that there is no conflict between the clustering revealed by AFLP and 
DNA hybridization on the one hand, and 16S rRNA sequences on the other hand. What 
actually happens is that each technique has its own window In the phylogenetic space and 
grouping analyses such as dendrograms should be looked at within that space In a 
technique such as AFLP, grouping levels below 40-50% should not be looked at, because at 
lower similarity levels, the number of incidentally matching bands becomes statistically too 
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important compared to the number of real matching bands, Le., which are identical. Likewise, 
DNA hybridization does not provide reliable similarity values below 40% because of the high 
experimental error of the technique with too different genomic DNA samples. On the other 
hand, 16S rRNA sequences do not allow taxonomlc distances to be calculated from one or a 
5 few different bases out of 1500. Only when a significant number of bases are different (e.g. 
10 or more), the calculated distances become statistically significant. 

The obvious problem in studies like this is how to obtain one consensus clustering, thereby 
respecting the taxonomic level of each technique, and thus preserving all the information 

10 offered by the techniques used. When the similarities of the constitutive matrices are 
averaged to obtain an average similarity matrix, the small but very notable differences 
obtained by 16S rRNA sequencing are mostly masked by the large similarity fluctuations 
obtained by DNA hybridization and, in particular, AFLP. Indeed, the 16S rRNA sequence 
similarity ranges between 94.7% and 99.9%, meaning a total span of less than 6% between 

15 the most distant strains. This is much less than the span of the other techniques 
(approximately 90%), and is even less than the experimental error of DNA hybridization. 

The logical result is a tree that reflects the species and pathovar classification, but does not 
well reflect the deeper phylogenetic structure of the genus as revealed by 16S rRNA 
20 sequences (Figure 14). 

An alternative method, using weighted averages, is not applicable to the combination of these 
techniques, since AFLP patterns in the present study are compared as densitometric curves 
of 2000 values by means of the Pearson product-moment correlation coefficient, whereas 
25 16S rRNA sequences are composed of 150O bases on average. These two measures are 
clearly uncomparable in terms of deriving weighted averages. In addition, DNA hybridization 
values are undefined in terms of numbers of characters. 

A method according to the invention of linearising similarity matrices which have been derived 
30 from different experimental techniques i.e. different data types, has been applied to the 
current data set The composite similarity matrix resulting from combining said linearised 
similarity matrices according to the invention produces the dendrogram shown in Figure 15. 
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The method successfully combines the information delivered by the described techniques into 
a consensus clustering which reflects both the deeper phylogenetic relationships of the 
genera and the species, pathovar, and even strain subdivision of the genus Xanthomonas. 
This study shows that the invention presents an invaluable tool for investigating the 
taxonomic and phylogenetic structure of bacterial taxa, and by extension, of all living 
organisms that are analyzed by a combination of techniques that reveal information at 
different taxonomic depth. Consequently, the method can also be of great use for 
identification purposes, by placing unknown organisms into known classification schemes 
thereby respecting the level of information offered by each technique used. 
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Example 3: Grouping members of a eukaryote genus based on fatty acid cell wall 
composition analysis using HPLC and Histone H3 sequencing. 

A total number of 15 individuals belonging to a eukaryote genus (taxon not further specified) 
5 are analyzed by their phenotype using HPLC analysis of cell wall fatty acid composition, and 
by genotype using histon H3 sequencing. The members are sampled from different sources: 
neutral, alkaline, and acid. 

A. Fatty acid cell wall composition 
10 HPLC based Fatty Acid Methyl Ester (FAME) analysis is a very sensitive technique which 
usually allows individual organisms to be separated from one another. The obtained fatty acid 
metyl ester profiles are very sensitive to environmental influences such as substrate, acidity, 
temperature. The fatty acid profiles are shown as percentages of total fatty acid amount in 
Tables. 
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Table 5. HPLC Fatty acid profiles obtained from 15 members of a eukaryote genus 
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When a UPGMA dendrogram is generated based on Euclidean distance calculation between 
the fatty acid profiles, the grouping obtained is as shown in Figure 16. Members of the same 
species from different sources (neutral, alkaline and acid) are well separated from each other, 
5 which suggests that the technique allows the origin of a sample to be identified based upon 
Its fatty acid profile. Between the species, however, the separation is not very clear, 
especially since A. sp. and A. aberrans do not occur in separate clusters. 

10 B. HiSTQN H3 SEQUENCE CLUSTERING 

Histon H3 sequences of approximately 520 base pairs long were aligned and clustered to 
reveal the deeper relationships between the species. The result is a dendrogram depicted in 
Figure 17. 

15 The separation of the two known species and the unknown species is very pronounced, and 
apparently, within A sylvestris, there is a phylogenetically aberrant member, which could not 
be discovered as such using fatty acid composition analysis. On the other hand, it is not 
possible, using Histon H3 sequence analysis, to discriminate between members sampled 
from sources with different pH. This is not surprising, as in the short term, environmental 

20 factors will influence the phenotype of the organisms rather than the genotype. 

C. CONSENSUS CLASSIFICATION OF FATTY ACID COMPOSITION AND HISTQN H3 
SEQUENCES 

25 When a consensus matrix is calculated from the individual similarity matrices calculated from 
fatty acid composition and Histon H3 sequences, the dendrogram obtained is as shown in 
Figure 18. 

The species subdivision as suggested by the Histon H3 sequences is preserved in the 
30 clustering, while the phenotypic information as obtained from fatty acid composition analysis 
is also reflected in the tree. Interestingly, the phylogenetically aberrant A. sylvestris member 
52441 is not clustered along with the other A. sylvestris members from neutral source, but is 
placed separately based on its sequence divergence. 

35 
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CLAIMS 



1 . A method suitable for producing a consensus classification of organisms using the data 
5 derived from two or more experiments performed on said organisms or samples thereof 

comprising the steps of: 

i) obtaining similarity matrices from the said data, 

ii) producing a composite similarity matrix that is a function of said similarity matrices, and 

iii) producing a consensus classification from said composite similarity matrix. 

10 

2. A method according to claim 1 wherein the function of step ii) comprises averaging the 
corresponding elements of said similarity matrices. 

3. A method according to claim 2 wherein each similarity matrix is weighted according the 
15 number of experimental characters used to cafculate said matrix, to arrive at the average. 

4. A method according to claim 2 wherein each similarity matrix is weighted by a user defined 
value to arrive at the average. 

20 5. A method according to claim 2, wherein said experiments produce product size or retention 
time results, and wherein the each element of each similarity matrix is weighted according to 
the number of bands or features associated with that element, to arrive at the average. 

6. A method according to claim 5 wherein said experiments are any of electrophoresis, high 
25 performance liquid chromatography, gas chromatography, capillary electrophoresis, 

chromatography, thin-layer chromatography, and/or mass spectrometry. 

7. A method according to claim 1 wherein the function of step ii) comprises the steps of: 
a) linearising said similarity data matrices, 

30 b) averaging the corresponding elements of said linearised similarity matrices of step a) 

8. A method according to claim 7 wherein step a) comprises the minimisation of equations: 
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wherein p is the number or organisms, samples or genotypes, wherein each technique k 
results in a matrix of pair-wise distance values, so that the distance value obtained between 

organism i and \ from technique k is given by d kg , wherein d kti='-^ wi *h 

S t = ^5*4*8 » wherein the consensus distance matrix D«is considered as the 

unknown true universal distance scale and wherein the goal is to search the consensus 
distances B 0 , g k and f k so that d ki} =/^(D iy )and T> tj =g k (d kv ) hold as true as possible. 

9. An apparatus suitable for performing the methods according to claims 1 to 8. 

10- A computer program comprising a computing routine, stored on a computer readable 
medium suitable for producing a consensus classification of organisms using the data derived 
from two or more experiments performed on said organisms or samples thereof comprising 
according to the methods of claims 1 to 8. 

1 1 . A device suitable for producing a consensus classification of organisms using the data 
derived from two or more experiments performed on said organisms or samples thereof 
comprising according to the methods of claims 1 to 8. 



20 
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ABSTRACT 

A METHOD FOR OBTAINING CONSENSUS CLASSIFICATIONS AND IDENTIFICATIONS 
BY COMBINING DATA FROM DIFFERENT EXPERIMENTS 

The present invention relates to methods for producing accurate consensus classifications of 
organisms by combining similarity matrices. It further relates to an apparatus and computer 
program therefor. 
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Figure 1-A 
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FIGURE 2-A 
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FIGURE 2-B 
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Figure 3-A 
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Figure 3-B 
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FIGURE 4 
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FIGURE 5A 
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FIGURE 6 
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FIGURE 11 
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FIGURE 12 
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FIGURE 13 
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FIGURE 14 
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FIGURE 15 
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FIGURE 16 
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FIGURE 17 

Pairvrfse (OG:100%.UG:0%) (FAST;2.10) Gapcost:0% 
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FIGURE 18 
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